Computing the smallest eigenvalue of large ill-conditioned 

Hankel matrices 



Niall Emmart^, Charles C. Weems^, and Yang Chen^ 

^'^ Computer Science Department 
University of Massachusetts 
Amherst, MA 01002, USA. 
E-Mail: nemmartOgmail . com and weemsScs .umass . edu 

^Department of Mathematics 
Imperial College 
180 Queen's Gate 
London SW7 2BZ, UK 
E-Mail: y.chen@ma.ic.ac.uk 



Abstract 

This paper presents a parallel algorithm for finding the smallest eigenvalue of a particu- 
lar form of ill-conditioned Hankel matrix, which requires the use of extremely high precision 
arithmetic. Surprisingly, we find that commonly-used approaches that are designed for high 
efficiency are actually less efficient than a direct approach under these conditions. We then de- 
velop a parallel implementation of the algorithm that takes into account the unusually high cost 
of individual arithmetic operations. Our approach combines message passing and shared mem- 
ory, achieving near-perfect scalability and high tolerance for network latency. We are thus able 
to find solutions for much larger matrices than has been previously possible, with the potential 
for extending this work to systems with greater levels of parallelism. 



Introduction 

In this paper, we study the problem parallelizing the computation of the smallest eigenvalue of a 
family of extremely ill-conditioned large Hankel matrices using a Beowulf cluster. The matrices are 
N hy N and are dense and symmetric, given by the formula: 

^^''^ = l^ C^p^^ ) j = °' ^-^^ 

Where /? and are the parameters that determine M and therefore its eigenvalues. 



^This work is in supported in part by the National Science Foundation under NSF grant #CNS-0619337. Any 
opinions, findings conclusions or recommendations expressed here are the author(s) and do not necessarily reflect 
those of the sponsors. 
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The gamma function, T(x), is a continuous extension of the factorial function, namely, T{n) = 
(n — 1)! when n € N. The following are two example matrixes for N = A, f] = 1 and = 4, /? = |: 



0! 1! 2! 3! 

1! 2! 3! 4! 

2! 3! 4! 5! 

3! 4! 5! 6! 



and 



fr(f) |r(f) 

|r(f) fr(^) 

|r(f) fr(^) 

|r(f) |r(f) 



fr(^) 
fr(f) 
fr(f) 



|r(f; 
fr(f: 
fr(f; 
fr(f; 



The matrices are Hankel moment matrices with the weight function i(;(x)=exp(— x^). 

This work is a follow on to previous work by Yang Chen and Nigel Lawrence jl] , who investigated 
the asymptotic behavior of the smallest eigenvalue of M as N ^ oo. In the numeric portion of their 
paper, they were able to compute the smallest eigenvalue for matrices up to size 300 by 300. Their 
research showed the problem required both extreme precision and a large amount of computation. 
Parallel computing capability has increased substantially since their initial work in 1999, but even 
today we estimate it would take over 250 hours to solve a medium size 1000 by 1000 instance on a 
uniprocessor and much longer for large matrices. Clearly an efficient parallel solution is needed. 

Unfortunately, because M is so ill-conditioned, it cannot be solved using standard double preci- 
sion arithmetic. Neither LAPACK nor the parallel ScaLAPACK offer sufficient precision. Instances 
beyond 25 by 25 can only be solved using an extended precision arithmetic package. For large ma- 
trices, the intermediate computations must be carried out with over 20,000 bits of precision. 

The enormous precision required for the intermediate computations presents opportunities and 
challenges. It shifts the balance of computation to communication by requiring much more com- 
putation for each scalar operation on the matrix elements. It also means that the amount of main 
memory needed to store the matrix is massive. The extended precision computations also affect 
cache locality. Therefore the array elements must be more finely partitioned among the nodes than 
normal, with respect to traditional heuristics for the blocking factors of large matrix computations. 

This paper explores various algorithms to solve the problem with the goal of maximizing the 
delivered performance from a Beowulf cluster. 



Hankel matrices, the moment problem and smallest eigenvalues 

Hankel matrices occur naturally in moment problems - for a given sequence of moments (mean, 
variance, skewness, kurtosis, etc), determine a probably distribution/measure that gives rise to the 
sequence of moments. See, for example, the monographs by Akhiezer [l] and by Krein [9] for in- 
depth studies. Moment problems are encountered in many fields from statistics to quantum physics 
to hydrology. They are classified according to the support of the distribution/measure (the set of 
points where the distribition/measure is non-zero). If the support is a closed interval (the Hausdorff 
moment problem) there will always be a unique solution. If the support is the whole number line 
(the Hamburger moment problem), then one can encounter a situation where the problem is said 
to be indeterminate, that is there are infinitely many probability distributions/measures with the 
same sequence of moments. 

Hankel matrices also appear in random matrix theory, see Mehta [10], a currently active research 
area that encompasses pure and applied mathematics and has applications in wireless communica- 
tions and multi-variate statistics. See for example the monograph by Kerov [8j which studies the 
connections between representation theory, moment problems and random matrices. 
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It was found in the classical papers by Szego [13] and by Widom and Wilf [15] that for those 
Hankel matrices (of order n) generated by a probability density supported on a closed interval, the 
corresponding smallest eigenvalue tends to zero as n tends to infinity. 

In [2] Berg, Chen and Ismail showed that in the infinite interval case, the moment problem 
will be indeterminate if and only if the smallest eigenvalue of the Hankel matrix tends to a strictly 
positive number as n tends to infinity. This is a new criteria for the determinancy of the Hamburger 
moment problem and hence our motivation for fast algorithms to compute the smallest eigenvalues 
of large Hankel matrices. 

Properties of M 

The algorithms we will explore in this paper rely on three properties of M: 

• M is real and symmetric, therefore, all the eigenvalues are real. 

• The eigenvalues of M are the zeros of the characteristic polynomial P{x) = det(M — xl). 

• M is the Hankel moment matrix for the weight function w{x)=eyip{—x^), therefore, there are 

distinct positive eigenvalues of M. 

The last point is very important for convergence of the Secant algorithm and can be proven 
as follows. Since w{x)=ex.p{—x^), we know that for any real polynomial Q{x) of degree A^, the 
moment integrals {i = 0, 1, 2, ...) will all exist: 

POO 

/ij := / x'^w{x)dx 
Jo 

further, since 

POO 

/ [Q{x)fw{x)dx > 
Jo 

the associated symmetric Hankel matrix Hn '■= {l^i+j)fj=o^ '^i^^ ^e positive definite and the 
eigenvalues will be distinct for any fixed A^. See Matrix Analysis, Horn and Johnson, Cambridge 
University Press, 1985 [7]. 



M is extremely ill-conditioned 

The condition number of a matrix is a good indication of how much precision is needed to compute 
the smallest eigenvalues. The larger the condition number the greater the required precision. 
The condition number is defined to be: 

cond{M) = ^^j^jj A AT is the largest eigenvalue and Ai is the smallest 

Condition numbers can be estimated using the computation of Ai and by bounding A at as 
follows. The Raleigh quotient function, p{u;M) ranges over the interval [Ai,A7v] for non-zero n, 
therefore: 
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, iMu,u) 
p[u; M) = — ^ < Aat For all non-zero u 



{u,u) 



Choosing u to be the column vector (0, 0, 0, 1): 

piu;M) = 

Further, Xn < trace(M), therefore: 



(Mn,n) _ 1 /2iV- 1 

{u,u) ~ p^y p 



1 /2iV-l 



/(^j<A.<trace(M) = -r(^ 



1 /2iV-l 



/fir-'- 

fc=0 



P V p 



For large N, 4r( ^ j is a good estimate for Aat because Ylk=o ^^i^^^w^) is very small 



compared to ^F^^^^^. 

The following table of condition numbers has been calculated using the lower bounds for Aat 
and the results of the Ai computations: 



LOWER BOUND ON COND(M) 



EXPONENTIAL FUNCTIONS 



N 




/3=l/2 


/3-1 


/3=7/4 


2^ 


iV! 




100 


8.52 el396 


7.36 6861 


9.40 e384 


1.94 6228 


1.27 630 


9.33 6157 


1.00 6200 


300 


5.17 e5066 


4.65 63167 


6.38 C1429 


3.55 6819 


2.04 690 


3.06 6614 


1.37 6743 


500 


4.56 e9116 


6.89 65726 


3.62 c2597 


4.11 61472 


3.27 6150 


1.22 61134 


3.05 61349 


1000 


1.85 620050 


6.97 612663 


7.62 65780 


6.80 63240 


1.07 6301 


4.02 62567 


1.00 63000 


1500 


1.11 631666 


3.81 620055 


8.45 69187 


3.32 65125 


3.51 6451 


4.81 64114 


1.37 64764 



As can be seen from this table, in each case the condition number is growing faster than 



N 



Algorithm Selection 

There are several standard techniques for finding eigenvalues, see [12]. For this paper we imple- 
mented four algorithms and evaluated them with a number of criteria: (a) how much precision 
does the algorithm require to meet a desired level of precision in the output; (b) how fast is the 
calculation relative to the other algorithms; (c) how effectively can the algorithm be parallelized. 
We implemented the following algorithms using the GNU Multiple Precision ^ library. 

• Lanczos Algorithm as described by Stoer and Bulirsch [TS] p. 401 

• Householder's Algorithm as described by Stoer and Bulirsch [13] p. 391 (with minor correc- 
tions) 

• The Jacobi Method with Rutishauser's enhancements, as described by Parlett [12] pp. 189- 
196. 

• A direct approach with an LDL^ determinant algorithm and a Secant root finder, as described 
below (hereafter referred to as the Secant algorithm) 
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The algorithms were tested by varying the number of bits of precision (K) of the inputs and the 
computations to achieve a specific level of precision in the result. In the following table, Accuracy 
is the number of correct significant digits in the result. Run Time is in seconds. Factor is the 
number of times slower this algorithm is than the fastest for a given A'^. The algorithm with the 
most accuracy for the least run time will be the best choice for computing the smallest eigenvalue. 



Aigoritlim 


N 




K 


Accuracy 


Run Time 


Factor 


Secant 


iUU 


1 
1 


oUU 


DU 


u.yo 


1 n 

i.U 


Householder 


100 


1 


2490 


9 


3.80 


3.96 


Jacobi 


100 


1 


700 


58 


5.91 


6.15 


Lanczos 


100 


1 


206250 


36 


361.46 


376.50 


Secant 


200 


1 


1400 


90 


23.67 


1.0 


Householder 


200 


1 


5940 


59 


134.32 


5.67 


Jacobi 


200 


1 


1500 


24 


216.55 


9.15 


Secant 


300 


1 


1600 


60 


144.02 


1.0 


Householder 


300 


1 


10000 


53 


990.26 


6.88 


Jacobi 


300 


1 


2400 


8 


2027.72 


14.08 


Secant 


400 


1 


2100 


51 


603.84 


1.0 


Householder 


400 


1 


13250 


41 


4036.11 


6.68 


Jacobi 


400 


1 


3375 


48 


9888.59 


16.38 



One unexpected result was the poor performance of the Lanczos algorithm, which failed for 
N=100 with anything less than 205000 bits of precision. There are a number of variants of the 
basic Lanczos algorithm - some orthogonalize the new vector after each iteration and some stop 
and restart the process, however we did not explore these because it did not appear that any of 
them would reduce the precision requirements to less than that of the other three algorithms. 

From these results we can conclude that the fastest technique to compute the smallest eigenvalue 
under these conditions is the Secant algorithm. In the remainder of the paper, we show that it can 
be effectively parallelized. 



The Secant Algorithm 

The Secant algorithm can be used to find the smallest root of the characteristic polynomial, P{x). 
P{x) can be defined as either det(x/ — M) or dei{M — xl). We use the latter because it guarantees 
P(0) will be positive. The secant algorithm starts with two initial points xi and X2 which must be 
less than Ai. We choose xi to be a small negative value and X2 to be zero. The Secant algorithm 
then computes a sequence of XiS using the following recurrence relation: 

Xi Xi—\ , . 

Xi+l — Xi — r —. -r(Xi) 
P[Xi) - P{Xi_i) 

The computation is complete when the most significant digits (15 decimal digits in our case) of 
Xi have stabilized. Since the eigenvalues of M are all unique, the roots of P are all simple and the 
Secant algorithm is guaranteed to converge rapidly to Ai [3J. Also note, since P has no inflection 
points less than Ai, the slope of the Secant at Xi will always be less than P'{x) when Xj < x < Ai 
and therefore the Secant is guaranteed not to overshoot Ai. 

Finally, there is no need to solve for P, instead, we can evaluate P{x) directly by computing 
det(M — xl). Thus, for our matrices, the problem of finding the smallest eigenvalue reduces to that 
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of solving a sequence of determinants. 

Interval Verification 

After the secant method has converged to some x, the next step is to prove that x is in fact 
an eigenvalue of M. To do this, we do two checks, first we truncate x to 15 significant digits 
and compute det(M — xl) which must be greater than 0. Then we add 1 to the least significant 
digit of X and compute det(M — xl) again. This result must be less than 0. To eliminate the 
possibility that round-off errors have poisoned the result, we perform these checks using fixed point 
interval arithmetic. See, for example. Interval Analysis, R.E. Moore [11]. In general the verification 
algorithm requires far more precision than the Secant root finder. 

The LDL^ Determinant Algorithm 

There are several standard techniques to compute the determinant of a matrix. The fastest general 
methods all involve factoring the matrix in some way and then calculating the determinant from the 
factors. In our case, we can make use of the fact that M is symmetric and factor M — xl into a lower 
triangular matrix, L, a diagonal matrix, D, and the transpose, L^, such that M — xl = LDL^ . 
Because x < \i, M — xl will be positive definite and the factorization is essentially the same as a 
Cholesky factorization except that it avoids the square root operations. The LDL^ factorization 
has numerical stability comparable to Cholesky factorization. Once the matrix is factored, the 
determinant is simply the product of the elements on the main diagonal of D. 

To make the algorithm easy to parallelize, we use the 'submatrix' order [5] for the LDL^ 
algorithm which applies a column of the matrix to all the remaining columns to its right. Presented 
below are serial and parallel versions of the algorithm. For clarity they both are shown using double 
precision arithmetic. In the actual implementations the calculations are done using the GMP integer 
package with fixed-point numbers. The elements of M are stored with K bits after the decimal 
point, which is specified when the algorithm is started. The elements of C, nextC, CDivDiag, and 
nextCDivDiag are all stored with K/2 bits after the decimal point. 

Serial LDL^ Algorithm: 

double LDLTDeterminant (double M[] [] , double x) { 
int processingCol , row, col, N=size(M); 
double CDivDiag [N], determinant=l . ; 

f or(processingCol=0; processingCoKN; processingCol++) { 
M [processingCol] [processingCol] -= x; 

determinant = determinant * M [processingCol] [processingCol] ; 

// CDivDiag is set to the current column that we're processing divided by 

// the entry on the diagonal. 

f or (row=processingCol+l ; row<N; row++) 

CDivDiag [row] = M [row] [processingCol] / M [processingCol] [processingCol] ; 
f or (col=processingCol+l ; coKsize; col++) 

for(row=col; row<N; row++) 

M [row] [col] -= M [col] [processingCol] * CDivDiag [row] ; 
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} 

return determinant; 

} 

Parallel LDL^ Algorithm: 

The parallel version of the LDL^ algorithm is designed to use both MPI and OpenMP. MPI is used 
to distribute data between nodes and OpenMP is used to spread the computation across multiple 
cores. We chose this solution for two reasons. First, the inter-core communication overhead of 
OpenMP is much less than for MPI. Second, the majority of the computation time is spent in 
just a few loops that naturally parallelize with OpenMP. The data distribution is done using MPI 
broadcasts. These are run in a separate thread which allows the communication to be overlapped 
with the computation. 

The following pseudo-code has annotations on the right hand side indicating aspects of our 
timing analysis, for example, {c}. The algorithm tracks the total amount of time spent executing 
{c} steps, likewise for {net} and {d}. c is for the main computation, net is for the time spent 
waiting for the network 10 to complete, and d for the remaining divisions required to complete 
CDivDiag. In addition one more timer records the total time spent computing the determinants. 

void assign(int assignments [] ) { 

// This routine assigns the columns to MPI processes. A column is assigned when 
// assignments [col] =rank (the MPI rank of the process). M is the number of columns 
// S is MPI size. 
for(col=0; coKN; col++) {. 
rank = col 7. (2*S) 

assignments [col] = (rank<S) ? rank : (2*S)-1 - rank; 

} 

} 

void applyCint processingCol , double C[] , double CDivDiag [] , double[][] M, 
int assignments [] , int startCol, int endCol) { 
// This routine applies the column C to columns of M from start to end 
f or (col=startCol ; col <= endCol; col++) 
if (assigments [col] ==myRcink) 
for(row=col; row<N; row++) 

M[row] [col] -= C [processingCol] * CDivDiag [row] 

} 

void startBackgroundTransmit (Column C, Column CDivDiag) { 

// This routine is run by a background thread. It uses MPI_Bcast to 
// transmits the column C to all of the other MPI processes. 
// While there is time left in the foreground computation, 
// send the next chunk of the CDivDiag column. 

} 

void startBackgroundReceive (Column C, Column CDivDiag) { 
// Receive the C column. 

// Receive any values of the CDivDiag column that are sent. If the 
// computation is 10 bound, no values will be sent. If it is CPU 
// bound, all the CDivDiag values will be sent. 
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} 



int waitForBackgroundCommunicationO {. 

11 Wait until the background ID is finished 

// Return the number of CDivDiag entries sent/received 

} 

double determinant (double M[] [] , double x) { 

double C[N], CDivDiag [N] , nextC[N], nextCDivDiag [N] , determinant=1.0; 
int processingCol, row, col, CDivDiagReceived, assignments [N] ; 

// Compute the assignments 
assign(assignments) ; 

// Set M to M-xI 
for(row=0; row<N; row++) 
M[row] [row] -= x; 

// Pull out the first column, and compute CDivDiag 
copyColumn(M, 0, C) ; 
for(row=l; row<N; row++) 
CDivDiag=C [row] / C [0] ; 

For(processingCol=0; processingCoKM-1 ; processingCol++) { 

if (assignments [processingCol+1] == myRank) { 
// Apply C to column processingCol+1 

apply (processingCol, C, CDivDiag, M, assignments, processingCol+1 , {c} 
processingCol+1) ; 

// Pull the column out of M. This column is finished eind needs to be 
// sent to the other MPI processes. Compute nextCDivDiag. 
copyColumn(M, processingCol+1 , nextC) ; 
f or(row=processingCol+2; row<N; row++) 

nextCDivDiag [row] = nextC [row] / nextC [processingCol+1] ; 

startBackgroundTrcinsmit(nextC, nextCDivDiag) ; 

// Apply C (processingCol) to the remainder of M 

apply (processingCol, C, CDivDiag, M, assignments, processingCol+2, N-1) ; {c} 
waitForBackgroundCommunicationO ; {net} 



} 

else { 

startBackgroundReceive(nextC, nextCDivDiag) ; 



// Apply C to all remaining columns of M 

apply (processingCol , C, CDivDiag, M, assignments, processingCol+1 , N-1); 
CDivDiagReceived=waitForBackgroundCommunication() ; 



{c} 
{net} 



// Compute any nextCDivDiag entries not received 
f or (row=processingCol+2+CDivDiagReceived; row<N; row++) 
nextCDivDiag [row] = nextC [row] / nextC [processingCol+1] ; 



{d} 
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} 

swapCC, nextC) ; 
swapCCDivDiag, nextCDivDiag) ; 
deteniiinant=deterniinaiit * C [processingCol+1] ; 

} 

return determinant; 

} 

Numerical results 

The following table shows a sampling of results for the smallest eigenvalue computations. It clearly 
illustrates the wide range of values that the algorithm must handle. 



THE SMALLEST EIGENVALUE OF M 



N 


13=1/2, 


13=1/2 


13=1 


13=1/4. 


100 


3.4720 


2.7397x10" 


-1 


2.1079x10" 


-15 


1.6976x10" 


-45 


300 


3.3984 


1.5837x10" 


-1 


5.5215x10" 


-28 


1.4844x10" 


-102 


500 


3.3763 


1.2047x10" 


-1 


1.1138x10' 


-36 


6.7121x10" 


-149 


1000 


3.3544 


8.2087x10" 


-2 


1.0892x10" 


-52 


3.6209x10" 


-246 


1500 


3.3447 


6.6295x10" 


-2 


5.4593x10" 


-65 


6.4232x10" 


-330 



Performance 

The performance testing was done on the University of Massachusetts Swarm cluster, which has 
60 compute nodes, each with 8 cores (Xeon 5355 processors at 2.66 GHz) and 16 GB of RAM per 
node. The nodes are connected via gigabit ethernet. The cluster is partitioned and our tests were 
run on 48 nodes, each with 5 cores for a total of 240 cores. 

The results show the cumulative time spent running the determinant algorithm. The results 
do not include the startup time, the time to generate the M matrix, nor the time spent running 
the secant algorithm. The total for these tasks is less than 1% of the time spent computing 
determinants. 

In the tables and graphs below Total Time is the total time spent computing determinants. 
Computation is the cumulative time spent executing {c} steps in the LDL^ algorithm and Net + 
Divs is the cumulative time in the {net} and {d} steps. For these results, we add the {net} and 
{d} times together because they both represent time/computations wasted due to a lack network 
bandwidth. 
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CONSTANT BETA 





/3 = 7/4, N=1000 


/3 = 7/4, N=1500 


/3 = 7/4, N=2000 


80 
Cores 


Total Time 
Computation 
Net + Divs 


0:43:57 (2,638) 
92.2% (2,432) 
7.2% (190) 


4:04:31 (14,672) 
97.0% (14,231) 
2.8% (417) 


13:57:35 (50,255) 
98.4% (49,438) 
1.5% (773) 


160 
Cores 


Total Time 
Computation 
Net + Divs 


No results 
available 


2:04:00 (7,440) 
93.6% (6,965) 
6.1% (452) 


7:05:11 (25,511) 
96.9% (24,727) 
3.0% (752) 


240 
Cores 


Total Time 
Computation 
Net + Divs 


0:17:33 (1054) 
77.0% (811) 
21.5% (227) 


1:24:53 (5,094) 
91.2% (4,645) 
8.4% (426) 


4:45:25 (17,125) 
96.2% (16,466) 
3.7% (630) 



The table shows that as N increases, the percentage of time spent doing useful computation 
increases. 



Figure 1: Constant Beta 



CONSTANT N 





/3 = 1/3, N=1500 


f3 = 1/2, N=1500 


(3 = 1, N=1500 


f3 = 7/4, N=1500 


80 
Cores 


Total Time 
Computation 
Net + Divs 


8:29:45 (30,585) 
88.5% (27,075) 
11.4% (3,483) 


6:38:32 (23,912) 
87.3% (20,868) 
12.6% (3,013) 


3:59:40 (14,380) 
85.4% (12,277) 
14.4% (2,076) 


4:04:31 (14,672) 
97.0% (14,231) 
2.8% (417) 


160 
Cores 


Total Time 
Computation 
Net + Divs 


6:47:22 (24,442) 
69.7% (17,031) 
30.2% (7,388) 


4:10:38 (15,038) 
68.6% (10,308) 
31.3% (4,699) 


3:26:40 (12,400) 
46.5% (5,761) 
53.3% (6,613) 


2:04:00 (7,440) 
93.6% (6,965) 
6.0% (452) 


240 
Cores 


Total Time 
Computation 
Net + Divs 


4:21:12 (15,673) 
54.7% (8,575) 
45.1% (7,073) 


3:38:29 (13,109) 
49.3% (6,469) 
50.4% (6,609) 


2:23:07 (8,588) 
43.9% (3,767) 
55.8% (4,793) 


1:24:53 (5,094) 
91.2% (4,645) 
8.4% (426) 



This table shows two trends. First, as /3 moves away from 1, the computation becomes more 
efficient. That is, the percentage of time spent doing computations increases and the percentage of 
time spend waiting on network 10 and divisions decreases. Second, if the computation is network 
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Figure 2: Constant N 



bound, increasing the number of cores makes the computation substantially less efficient. However, 
if the computation is CPU bound, then increasing the number of cores makes the computation only 
marginally less efficient. In fact, the amount of time wasted waiting for network 10 and divisions is 
almost constant. The percentage of wasted time increases only because the overall time to complete 
the computation decreases. In other words, the algorithm scales almost perfectly when it is CPU 
bound. 

Communication 

There are a couple of interesting things to note about the algorithm. First, it works by preparing 
a column, starting a background broadcast and then applying the column to the remainder of the 
matrix. As the matrix size grows, or the precision in the numbers grows, it becomes easier and 
easier to fully overlap the computation and communication. This means that, with a large enough 
problem, network latency has minimal impact on performance, which is dominated by network 
throughput. Given sufficient network throughput, the algorithm can scale to very large systems. 

Second, as grows, the precision (K) required to perform the computation increases. In the 
inner loop of the computation two numbers, each with K/2 bits are being multiplied. Increasing 
K thus increases the processing time by approximately 0{K^), while the communication time 
increases only linearly. Therefore, as N grows, there is a very powerful effect on the efficiency and 
scalability of the algorithm. 

Conclusion 

Large ill-conditioned Hankel matrices present an unusual mix of computation that are not readily 
solved with traditional approaches. We have explored a space of potential algorithmic solutions to 
determine which approach provides the greatest efficiency given the special precision requirements 
of the problem. Surprisingly, we found that a direct method is more effective than the more sophis- 
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ticated algorithms that are commonly employed. Our parallel implementation of this algorithm 
takes into account the atypically large computation time required for individual operations, and 
uses both shared memory and message passing to optimize performance. The result is a parallel 
implementation that scales nearly perfectly, and that can be made nearly insensitive to network la- 
tency while taking maximum advantage of network throughput. The algorithms we developed have 
thus proved to be an elegant, efficient, fast and scalable solution to the problem, with guaranteed 
numeric results. As a result, we have been able to considerably extend the known set of solutions 
for these matrices using a modest Beowulf cluster, and have shown that much larger instances 
should be easy to solve on systems with a higher degree of parallelism. 
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